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Abstract 



An indexed sequence of strings is a data structure for storing a string sequence that supports random 
access, searching, range counting and analytics operations, both for exact matches and prefix search. String 
sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. 
In these applications each string can appear several times and the order of the strings in the sequence is 
relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract 
interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly 
into higher performance, since more data can fit in fast memory. 

We introduce and study the problem of compressed indexed sequence of strings, representing indexed 
sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while 
preserving provably good performance for the supported operations. 

We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia 
Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet 
Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art 
compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, 
both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space 
occupancy to close to the entropy of the sequence. 

1 Introduction 

Many problems in databases and information retrieval ultimately reduce to storing and indexing sequences of 
strings. Column-oriented databases represent relations by storing individually each column as a sequence; if each 
column is indexed, efficient operations on the relations are possible. XML databases, taxonomies, and word tries 
are represented as labeled trees, and each tree can be mapped to the sequence of its labels in a specific order; 
indexed operations on the sequence enable fast tree navigation. In data analytics query logs and access logs are 
simply sequences of strings; aggregate queries and counting queries can be performed efficiently with specific 
indexes. Textual document search is essentially the problem of representing a text as the sequence of its words, and 
queries locate the occurrences of given words in the text. Even the storage of non-string (for example, numeric) 
data can be often reduced to the storage of strings, as usually the values can be binarized in a natural way. 

Indexed sequence of strings. An indexed sequence of strings is a data structure for storing a string sequence 
that supports random access, searching, range counting and analytics operations, both for exact matches and 
prefix search. Each string can appear several times and the order of the strings in the sequence is relevant. For a 
sequence S = (s , ■ ■ ■ , s„-i) of strings, the primitive operations are: 

• Access(pos): retrieve string s pos , where < pos < n. 

• Rank(s, pos): count the number of occurrences of string s in (s , . . . , s pos _i). 

• Select(s, idx): find the position of the idx-th occurrence of s in (sq, ■ ■ ■ , s n -i)- 
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By composing these three primitives it is possible to implement other powerful index operations. For example, 
functionality similar to inverted lists can be easily formulated in terms of Select. These primitives can be extended 
to prefixes. 

• RankPrefix(p, pos): count the number of strings in (sq, . . . , s pos -i) that have prefix p. 

• SelectPrefix(p, idx): find the position of the idx-th string in (so, ■ ■ • , s n -i) that has prefix p. 

Prefix search operations can be easily formulated in terms of SelectPrefix. Other useful operations are range 
counting and analytics operations, where the above primitives are generalized to a range [pos', pos) of positions, 
hence to (s pos >, . . . , s pos _i). In this way statistically interesting (e.g. frequent) strings in the given range [pos', pos) 
and having a given prefix p can be quickly discovered (see Section [5] for further operations) . 

The sequence S can change over time by defining the following operations, for any arbitrary string s (which 
could be previously unseen). 

• Insert(s, pos): update the sequence S as S' = (so, . . . , s pos _i, s, s pos , . . . , s n -i) by inserting s immediately 
before s pos . 

• Append(s): update the sequence S as S' = (sq, . . . , s n -i> s ) by appending s at the end. 

• Delete(pos): update the sequence S as S' = (sq, . . . , s pos _i, s pos +i, • ■ • , Sn-i) by deleting s pos - 

Motivation. String sequences lie at the core of column-oriented databases, log processing, and other storage and 
query tasks. The prefix operations supported by an indexed sequence of strings arise in many contexts. Here we 
give a few examples to show that the proposed problem is quite natural. In data analytics for query logs and access 
logs, the sequence order is the time order, so that a range of positions [pos', pos) corresponds to a given time frame. 
The accessed URLs, paths (filesystem, network, . . . ) or any kind of hierarchical references are chronologically 
stored as a sequence (s , ■ • ■ , s n _i) of strings, and a common prefix denotes a common domain or a common folder 
for the given time frame: we can retrieve access statistics using RankPrefix and report the corresponding items by 
iterating SelectPrefix (e.g. "what has been the most accessed domain during winter vacation?"). This has a wide 
array of applications, from intrusion detection and website optimization to database storage of telephone calls. 
Another interesting example arises in web graphs and social networks, where a binary relation is stored as a graph 
among the entities, so that each edge is conceptually a pair of URLs or hierarchical references (URIs). Edges can 
change over time, so we can report what changed in the adjacency list of a given vertex in a given time frame, 
allowing us to produce snapshots on the fly (e.g. "how did friendship links change in that social network during 
winter vacation?"). In the above applications the many strings involved require a suitable compressed format. 
Space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast 
memory. 

Compressed indexed sequence of strings. We introduce and study the problem of compressed indexed sequence 
of strings representing indexed sequences of strings in nearly-optimal compressed space, both in the static and 
dynamic settings, while preserving provably good performance for the supported operations. 

Traditionally, indexed sequences are stored by representing the sequence explicitly and indexing it using 
auxiliary data structures, such as B-Trees, Hash Indexes, Bitmap Indexes. These data structures have excellent 
performance and both external and cache-oblivious variants are well studied [23 . Space efficiency is however 
sacrificed: the total occupancy is several times the space of the sequence alone. In a latency constrained world 
where more and more data have to be kept in internal memory, this is not feasible anymore. 

The field of succinct and compressed data structures comes to aid: there is a vast literature about compressed 
storage of sequences, under the name of Rank/Select sequences [TS]. The existing Rank/Select data structures, 
however, assume that the alphabet from which the sequences are drawn is integer and contiguous, i.e. each 
element of the sequence is just a symbol in {I, . . . , a}. Non-integer or non-contiguous alphabets need to be mapped 
first to an integer range. Letting S se t denote the set of distinct strings in the sequence S = (so, . . . , s„_i), the 
representation of S as a sequence of n integers in {1, . . . , |S , se t|} requires to map each Sj to its corresponding 
integer, thus introducing at least two issues: (a) once the mapping is computed, it cannot be changed, which 
means that in dynamic operations the alphabet must be known in advance; (b) for string data the string structure 
is lost, hence no prefix operations can be supported. Issue (a) in particular rules out applications in database 
storage, as the set of values of a column (or even its cardinality) is very rarely known in advance; similarly in text 
indexing a new document can contain unseen words; in URL sequences, new URLs can be created at any moment. 
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Wavelet Trie. We introduce a new data structure, the Wavelet Trie, that overcomes the previously mentioned 
issues. The Wavelet Trie is a generalization for string sequences S of the Wavelet Tree [T3], a compressed data 
structure for sequences, where the shape of the tree is induced from the structure of the string set S se t as in 
the Patricia Trie |19| . This enables efficient prefix operations and the ability to grow or shrink the alphabet as 
values are inserted or removed. We first present a static version of the Wavelet Trie in Section |3j We then give an 
append-only dynamic version of the Wavelet Trie, meaning that elements can be inserted only at the end — the 
typical scenario of query logs and access logs — and a fully dynamic version that is useful for database applications 
(see Section [4|. 

Our time bounds are reported in Table [I] and some comments are in order. Recall that S denotes the input 
sequence of strings stored in the Wavelet Trie, and S set is the set of distinct strings in S. For a string s to be 
queried, let h s denote the number of nodes traversed in the binary Patricia Tree storing S'sct when s is searched 
for. Observe that h s < \s \ log |S|, where £ is the alphabet of symbols from which s is drawn, and \s \ log |S| is the 
length in bits of s (while \s\ denotes its number of symbols as usual). The cost for the queries on the static and 
append-only versions of the Wavelet Trie is 0(\s\ + h s ) time, which is the same cost as searching in the binary 
Patricia Trie. Surprisingly, the cost of appending s to S is still 0(\s\ + h s ) time, which means that compressing 
and indexing a sequential log on the fly is very efficient. The cost of the operations for the fully dynamic version 
are also competitive, without the need of knowing the alphabet in advance. This answers positively a question 
posed in [T^] and |18| . 



Query Append Insert Delete Space (in bits) 

Static 0(\s\ + h s ) LB+o(hn) 

Append-only 0(\s\ + h s ) 0(\s\+h s ) LB + PT +o(/m) 

Fully-dynamic OQs\ + h s log n) OQs\ + h s logn) 0(\s\ + h s logn) 0(\s\ + h 3 logn) 1 " LB + PT +0(nHo) 

Table 1: Bounds for the Wavelet Trie. Query is the cost of Access, Rank(Prefix), Select(Prefix), LB is the information 
theoretic lower bound LT+ni?o (Sect, wj, and PT the space taken by the dynamic Patricia Trie (Sect.|4]|. ^Note that 
deletion may take 0(1 + h„ logn) time when deleting the last occurrence of a string, where t is the length of the longest 
string in S sct . 

All versions are nearly optimal in space as shown in Table [I] In particular, the lower bound LB(S') for storing 
an indexed sequence of strings can be derived from the lower bound LT(S , set ) for storing S se t given in [7] plus 
Shannon classical zero-order entropy bound nH Q (S) for storing S as a sequence of symbols. The static version uses 
an additional number of bits that is just a lower order term o(hn), where h is the average height of the Wavelet 



Tree (Definition 3.4 ) . The append-only version only adds PT(S , set ) = 0(\S sc t\^ogn) bits for keeping 0(15*8011) 
pointers to the dynamically allocated memory (assuming that we do not have control on the memory allocator on 
the machine). The fully dynamic version has a redundancy of 0(nHo(S)) bits. 

Results. Summing up the above contributions: we address a new problem on sequences of strings that is meaningful 
in real-life applications; we introduce a new compressed data structure, the Wavelet Trie, and analyze its nearly 
optimal space; we show that the supported operations are competitive with those of uncompressed data structures, 
both in the static and dynamic setting. We have further findings in this paper. In case the prefix operations 
are not needed (for example when the values are numeric) , we show in Section [6] how to use a Wavelet Trie to 
maintain a probabilistically balanced Wavelet Tree, hence guaranteeing access times logarithmic in the alphabet 
size. Again, the alphabet does not need to be known in advance. We also present an append-only compressed 
bitvector that supports constant-time Rank, Select, and Append in nearly optimal space. We use this bitvector in 
the append-only Wavelet Trie. 

Related work. While there has been extensive work on compressed representations for sets of strings, to the 
best of our knowledge the problem of representing sequences of strings has not been studied. Indexed sequences of 
strings are usually stored in the following ways: (1) by mapping the strings to integers through a dictionary, the 
problem is reduced to the storage of a sequence of integers; (2) by concatenating the strings with a separator, 
and compressing and full-text indexing the obtained string; (3) by storing the concatenation (si,i) in a string 
dictionary such as a B-Tree. 

The approach in (1), used implicitly in (3j[8] and most literature about Rank/Select sequences, sacrifices the 
ability to perform prefix queries. If the mapping preserves the lexicographic ordering, prefixes are mapped to 
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contiguous ranges; this enables some prefix operations, by exploiting the two-dimensional nature of the Wavelet 
Tree: RankPrefix can be reduced to the RangeCount operation described in [T7]. To the best of our knowledge, 
however, even with a lexicographic mapping there is no way to support efficiently SclcctPrcfix. More importantly, 
in the dynamic setting it is not possible to change the alphabet (the underlying string set S set ) without rebuilding 
the tree, as previously discussed. 

The approach in (2), called Dynamic Text Collection in |18j . although it allows for potentially more powerful 
operations, is both slower, because it needs a search in the compressed text index, and less space-efficient, as it 
only compresses according to the fc-order entropy of the string, failing to exploit the redundancy given by repeated 
strings. 

The approach in (3), used often in databases to implement indexes, only supports Select, while another copy 
of the sequence is still needed to support Access, and it does not support Rank. Furthermore, it offers little or no 
guaranteed compression ratio. 

2 Preliminaries 

Information-theoretic lower bounds. We assume that all the logarithms are in base 2, and that the word 
size is w > logn bits. Let s = C\ . . . c n € S* be a sequence of length |s| = n, drawn from an alphabet S. The 
binary representation of s is a binary sequence of n[log bits, where each symbol Cj is replaced by the [log 
bits encoding it. The zero-order empirical entropy of s is defined as Hq(s) — — X) c es TT > where n c is the 

number of occurrences of symbol c in s. Note that nHo(s) < nlog |S| is a lower bound on the bits needed to 
represent s with an encoder that does not exploit context information. If s is a binary sequence with £ = {0, 1} 
and p is the fraction of Is in s, we can rewrite the entropy as Hq(s) — —plogp — (1 — p) log(l — p), which we also 
denote by H(p). We use B(m, n) as a shorthand for [log (")] , the information-theoretic lower bound in bits for 
storing a set of m elements drawn from an universe of size n. We implicitly make extensive use of the bounds 
B{m,n) <nH(f) + 0{l), and B{m,n) < m log(^) + 0(m). 

Bitvectors and FIDs. Binary sequences, i.e. £ = {0, 1}, are also called bitvectors, and data structures that 
encode a bitvector while supporting Access/Rank/Select are also called Fully Indexed Dictionaries, or FIDs 
[22\. The representation of [22_, referred to as RRR, can encode a bitvector with n bits, of which m Is, in 
B(m,n) + 0((n log logn)/ logn) bits, while supporting all the operations in constant time. 

Wavelet Trees. The Wavelet Tree, introduced in [T3], is the first data structure to extend Rank/Select operations 
from bitvectors to sequences on an arbitrary alphabet S, while keeping the sequence compressed. Wavelet Trees 
reduce the problem to the storage of a set of |E| — 1 bitvectors organized in a tree structure. 

The alphabet is recursively partitioned in two subsets, until each subset is a singleton (hence the leaves are in 
one-to-one correspondence with the symbols of £) . The bitvector f3 at the root has one bit for each element of the 
sequence, where is 0/1 if the i-th element belongs to the left/right subset of the alphabet. The sequence is 
then projected on the two subsets, obtaining two subsequences, and the process is repeated on the left and right 
subtrees. An example is shown in Figure [l] 

Note that the 0s of one node are in one-to-one correspondence with the bits of the left node, while the Is are in 
correspondence with the bits of the right node, and the correspondence is given downwards by Rank and upwards 
by Select. Thanks to this mapping, it is possible to perform Access and Rank by traversing the tree top-down, 
and Select by traversing it bottom-up. 

By using RRR bitvectors, the space is nHo(S) + o(n log |E|) bits, while operations take 0(log |S|) time. 

Patricia Tries. The Patricia Trie |19| (or compacted binary trie) of a non-empty prefix-free set of binary strings 
is a binary tree built recursively as follows, (i) The Patricia Trie of a single string is a node labeled with the 
string, (m) For a nonempty string set S, let a be the longest common prefix of S (possibly the empty string). Let 
Si, = {j\ab-f E S} for b £ {0, 1}. Then the Patricia trie of S is the tree whose root is labeled with a and whose 
children (respectively labeled with and 1) are the Patricia Tries of the sets So and Si. Unless otherwise specified, 
we use trie to indicate a Patricia Trie, and we focus on binary strings. 

3 The Wavelet Trie 

We informally define the Wavelet Trie of a sequence of binary strings S as a Wavelet Tree on S (seen as a sequence 
on the alphabet S = S sc t) whose tree structure is given by the Patricia Trie of S^et- We focus on binary strings 
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{a,b} 



abracadabra 
00101010010 



abaaaba 
0100010 



b} 



{c, d, r} 



rcdr 
1011 



{d,r} 



a 




b 




c 



rdr 
101 



{r} 



Figure 1: A Wavelet Tree for the input sequence 
abracadabra from the alphabet {a,b, c, d, r}. 



a : 
B : 0010101 



a : e 
8 : 0111 



a : 1 



a : e 
8 : 100 



a : 00 



a : 00 



Figure 2: The Wavelet Trie of the sequence of strings 
(0001, 0011, 0100, 00100, 0100, 00100, 0100). 



without loss of generality, since strings from larger alphabets can be binarized as described in Section [2] Likewise, 
we can assume that S se t is prefix-free, as any set of strings can be made prefix-free by appending a terminator 
symbol to each string. 

A formal definition of the Wavelet Trie can be given along the lines of the Patricia Trie definition of Section [2] 

Definition 3.1 Let S be a non-empty sequence of binary strings, S = (sq, . . . , s n _i), Si <G {0,1}*, whose 
underlying string set S se t is prefix-free. The Wavelet Trie of S, denoted WT(5), is built recursively as follows: 

(i) If the sequence is constant, i.e. s, — a for all i, the Wavelet Trie is a node labeled with a. 

(ii) Otherwise, let a be the longest common prefix of S. For any < i < n we can write Si = abiji, where bi 
is a single bit. For b € {0,1} we can then define two sequences S& = — b), and the bitvector j3 = (bi); in 
other words, S is partitioned in the two subsequences depending on whether the string begins with a0 or al, 
the remaining suffixes form the two sequences So and Si, and the bitvector /3 discriminates whether the suffix 7.; 
is in So or Si. Then the Wavelet Trie of S is the tree whose root is labeled with a and (3, and whose children 
(respectively labeled with and 1) are the Wavelet Tries of the sequences So and Si. 

An example is shown in Fig. [2] Note that leaves are labeled only with the common prefix a while internal 
nodes are labeled both with a and the bitvector /3. The Wavelet Trie is a generalization of the Wavelet Tree on S: 
each node splits the underlying string set S se t in two subsets and a bitvector is used to tell which elements of the 
sequence belong to which subset. Using the same algorithms in [13] we obtain the following. 

Lemma 3.2 The Wavelet Trie supports Access, Rank, and Select operations. In particular, if h s is the number of 
internal nodes in the root-to-node path representing s in WT(S), Access(pos) performs 0(h s ) Rank operations on 
the bitvectors, where s is the resulting string; Rank(s,pos) performs 0{h s ) Rank operations on the bitvectors; 
Select (s, idx) performs 0(h s ) Select operations on the bitvectors. 

It is interesting to note that any Wavelet Tree can be seen as a Wavelet Trie through a specific mapping of 
the alphabet to binary strings. For example the classic balanced Wavelet Tree can be obtained by mapping each 
element of the alphabet to a distinct string of [log a] bits; another popular variant is the Huffman-tree shaped 
Wavelet Tree, which can be obtained as a Wavelet Trie by mapping each symbol to its Huffman code. 



Prefix operations. It follows immediately from Definition 3.1 that for any prefix p occurring in at least one 
element of the sequence, the subsequence of strings starting with p is represented by a subtree of WT(S). 

This simple property allows us to support two new operations, RankPrefix and SelectPrefix, as defined in the 
introduction. The implementation is identical to Rank and Select, with the following modifications: if n p is the 
node obtained by prefix-searching p in the trie, for RankPrefix the top-down traversal stops at n p ; for SelectPrefix 
the bottom-up traversal starts at n p . This proves the following lemma. 

Lemma 3.3 Let p be a prefix occurring in the sequence S. Then RankPrefix(p, pos) performs 0(h p ) Rank 
operations on the bitvectors, and SelectPrefix(p, idx) performs 0(h p ) Select operations on the bitvectors. 
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Note that, since S se t is prefix- free, Rank and Select on any string in iS set are equivalent to RankPrefix and 
SelectPrefix, hence it is sufficient to implement these two operations. 

Average height. To analyze the space occupied by the Wavelet Trie, we define the average height. 
Definition 3.4 The average height h of a WT(S) is defined as h = ^ J2"=o h St - 

Note that the average is taken on the sequence, not on the set of distinct values. Hence we have hn < \si\ 
(i.e. the total input size), but we expect hn <C ^JlZo \ s i\ m rea, l situations, for example if short strings are more 
frequent than long strings, or they have long prefixes in common (exploiting the path compression of the Patricia 
Trie). The quantity hn is equal to the sum of the lengths of all the bitvectors j3, since each string Sj contributes 
exactly one bit to all the internal nodes in its root-to-leaf path. Also, the root-to-leaf paths form a prefix-free 
encoding for S se t, and their concatenation for each element of S is an order-zero encoding for S, thus it cannot be 
smaller than the zero-order entropy of S, as summarized in the following lemma. 

Lemma 3.5 Let h be the average height of WT(S). Then H (S) <h<~ YZjj N- 

Static succinct representation. Our first representation of the Wavelet Trie is static. We show how by using 
suitable succinct data structures the space can be made very close to the information theoretic lower bound. 

To store the Wavelet Trie we need to store its two components: the underlying Patricia Trie and the bitvectors 
in the internal nodes. 

We represent the trie using a DFUDS [2 encoding, which encodes a tree with k nodes in 2k + o(k) bits, while 
supporting navigational operations in constant time. Since the internal nodes in the tree underlying the Patricia 
Trie have exactly two children, we compute the corresponding tree in the first-child/next-sibling representation. 
This brings down the number of nodes from 2|S , sct | — 1 to |S se t|) while preserving the operations. Hence we can 
encode the tree structure in 2|5 sc t| + o(|S , sct |) bits. If we denote the number of trie edges as e = 2(|S , sot | — 1), the 
space can be written as e 4- o(|iS , se t|)- 

The e labels a of the nodes are concatenated in depth-first order in a single bitvector L. We use the partial 
sum data structure of [22J to delimit the labels in L. This adds B(e, \L\ + e) + o(\S sc t\) bits. The total space (in 
bits) occupied by the trie structure is hence \L\ + e + B(e, \L\ + e) + o(\S se t\)- 

We now recast the lower bound in [7] using our notation, specializing it for the case of binary strings. 

Theorem 3.6 ([7]) For a prefix-free string set S set , the information-theoretic lower bound LT(5 sc t) for encoding 
S se t is given by LT(S se t) = \L\ + e + B(e, \L\ + e), where L is the bitvector containing the e labels a of the nodes 
concatenated in depth-first order. 

It follows immediately that the trie space is just the lower bound LT plus a negligible overhead. 

It remains to encode the bitvectors j3. We use the RRR encoding, which takes \/3\Hq(/3) + o(\j3\) to compress 
the bitvector j3 and supports constant-time Rank/Select operations. In [13] it is shown that, regardless of the 
shape of the tree, the sum of the entropies of the bitvectors /?'s add up to the total entropy of the sequence, 
nH (S), plus negligible terms. 

With respect to the redundancy beyond nHo(S), however, we cannot assume that ISgetl = o(n) and that 
the tree is balanced, as in [13 and most Wavelet Tree literature; in our applications, it is well possible that 
I Sset | = ©("■), so a more careful analysis is needed. In Appendix [A] Lemma A. 4 we show that in the general case 
the redundancy add up to o(hn) bits. 

We concatenate the RRR encodings of the bitvectors, and use again the partial sum structure of | 22[ to de limit 
the encodings, with an additional space occupancy of o(hn). The bound is proven in Appendix lAl Lemma 
Overall, the set of bitvectors occupies nHo{S) + o(hn) bits. 

All the operations can be supported with a trie traversal, which takes 0(|s|) time, and 0(h s ) Rank/Select 
operations on the bitvectors. Since the bitvector operations are constant time, all the operations take 0(\s\ + h s ) 
time. Putting together these observations, we obtain the following theorem. 

Theorem 3.7 The Wavelet Trie WT(S) of a sequence of binary strings S can be encoded in LT(S sot ) + nH (S) + 
o(hn) bits, while supporting the operations Access, Rank, Select, RankPrefix, and SelectPrefix on a string s in 
0(\s\ + h s ) time. 



A. 



G 



Note that when the tree is balanced both time and space bounds are basically equivalent to those of the 



standard Wavelet Tree. We remark that the space upper bound in Theorem 3.7 is just the information theoretic 



lower bound LB(5) = LT(5 se t) + nHo(S) plus an overhead negligible in the input size. 

4 Dynamic Wavelet Tries 

In this section we show how to implement dynamic updates to the Wavelet Trie, resulting in the first compressed 
dynamic sequence with dynamic alphabet. This is the main contribution of the paper. 

Dynamic variants of Wavelet Trees have been presented recently [THl HH HH] ■ They all assume that the alphabet 
is known a priori, hence the tree structure is static. Under this assumption it is sufficient to replace the bitvectors in 
the nodes with dynamic bitvectors with indels, bitvectors that support the insertion of deletion of bits at arbitrary 
points. Insertion at position pos can be performed by inserting or 1 at position pos of the root, whether the leaf 
corresponding to the value to be inserted is on the left or right subtree. A Rank operation is used to find the new 
position pos' in the corresponding child. The algorithm proceeds recursively until a leaf is reached. Deletion is 
symmetric. 

The same operations can be implemented on a Wavelet Trie. The novelty consists in the ability of inserting 
strings that do not already occur in the sequence, and of deleting the last occurrence of a string, in both cases 
changing the alphabet S se t and thus the shape of the tree. To do so we represent the underlying tree structure of 
the Wavelet Trie with a dynamic Patricia Trie. We summarize the properties of a dynamic Patricia Trie in the 
following lemma. The operations are standard, but we describe them in Appendix [B] for completeness. 

Lemma 4.1 A dynamic Patricia Trie on k binary strings occupies O(kw) + \L\ bits, where L is defined as in 



Theorem 3.6 Besides the standard traversal operations in constant time, insertion of a new string s takes 0(\s\ 



time. Deletion of a string s takes 0(£) time, where I is the length of the longest string in the trie. 

Updating the bitvectors. Each internal node of the trie is augmented with a bitvector (3, as in the static 
Wavelet Trie. Inserting and deleting a string induce the following changes on the bitvectors /3s. 

Insert(s, pos): If the string is not present, we insert it into the Patricia Trie, causing the split of an existing 
node: a new internal node and a new leaf are added. We initialize the bitvector in the new internal node as a 
constant sequence of bits 6 if the split node is a 6-labeled child of the new node; the length of the new bitvector is 
equal to the length of the sequence represented by the split node (i.e. the number of b bits in the parent node if 
the split node is a 6-labeled child). The algorithm then follows as if the string was in the trie. This operation is 
shown in Figure [3] Now we can assume the string is in the trie. Let prefix a and bitvector j3 be the labels in the 
root. Since the string is in the trie, it must be in the form 0:67, where b is a bit. We insert b at position pos in (3 
and compute pos' = Rank(6, pos) in /3, and insert recursively 7 in the 6-labeled subtree of the root at position 
pos'. We proceed until we reach a leaf. 

Delete(pos): Let f3 be the bitvector in the root. We first find the bit corresponding to position pos in the 
bitvector, b = Access(pos) in f3. Then we compute pos' = Rank(6, pos) in /3, and delete recursively the string at 
position pos' from the 6-labeled subtree. We then delete the bit at position pos from f3. We then check if the 
parent of the leaf node representing the string has a constant bitvector; in this case the string deleted was the last 
occurrence in the sequence. We can then delete the string from the Patricia Trie, thus deleting an internal node 
(whose bitvector is now constant) and a leaf. 

In both cases the number of operations (Rank, Insert, Delete) on the bitvectors is bounded by 0(h s ). The 
operations we need to perform on the bitvectors are the standard insertion/deletion, with one important exception: 
when a node is split, we need to create a new constant bitvector of arbitrary length. We call this operation 
Init(6, n), which fills an empty bitvector with n copies of the bit 6. The following remark rules out for our purposes 
most existing dynamic bitvector constructions. 

Remark 4.2 If the encoding of a constant (i.e. 0™ or 1") bitvector uses uj{f(n)) memory words (of size w), 
Init(6, n) cannot be supported in 0(f(n)) time. 

Uncompressed bitvectors use ft(n/w) words; the compressed bitvectors of |18l H2] , although they have a 
desirable occupancy of \/3\Ho(f3) + o(\f3\), have Q(n log log n/(w log n)) words of redundancy. Since we aim for 
polylog operations, these constructions cannot be considered as is. 
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Figure 3: Insertion of the new string s = . . . 7IA at position 3. An existing node is split by adding a new internal node 
with a constant bitvector and a new leaf. The corresponding bits are then inserted in the root-to-leaf path nodes. 



Main results. We first consider the case of append-only sequences. We remark that, in the Insert operation 
described above, when appending a string at the end of the sequence the bits inserted in the bitvectors are 
appended at the end, so it is sufficient that the bitvectors support an Append operation in place of a general 
Insert. Furthermore, Init can be implemented simply by adding a left offset in each bitvector, which increments 
each bitvector space by O(logn) and can be checked in constant time. Using the append-only bitvectors described 
in Section |4.1| and observing that the redundancy is as in Section [3j we can state the following theorem. 

Theorem 4.3 The append-only Wavelet Trie on a dynamic sequence S supports the operations Access, Rank, 
Select, RankPrefix, SelectPrefix, and Append in 0(\s\ + h s ) time. The total space occupancy is 0(\S se t\w) + \L\ 
nHo(S) + o(hn) bits, where L is defined as in Theorem 



3.6 



Using instead the fully-dynamic bitvectors in Section |4. 2 1 we can state the following theorem. 

Theorem 4.4 The dynamic Wavelet Trie on a dynamic sequence S supports the operations Access, Rank, Select, 
RankPrefix, SelectPrefix, and Insert in 0(\s\ + h s \ogn) time. Delete is supported in 0(\s\ + h s \ogn) time if s 
occurs more than once, otherwise time is 0(1 + h s logn), where I is the length of the longest string. The total 



space occupancy is O(nH (S) + \S sct \w) + L bits, where L is defined as in Theorem 3.6 



Note that, using the compact notation defined in the intr oduc tion, the space bound in Theorem |4.3| can be 



written as LB(S) + PT(S sct ) + o(hn), while the one in Theorem |4.4| can be written as LB(S') +PT(S' sc t) + 0(uHq). 
4.1 Append-only bitvectors 

In this section we describe an append-only bitvector with constant-time Rank/Select/ Append operations and 
nearly-optimal space occupancy. The data structure uses RRR as a black-box data structure, assuming only its 
query time and space guarantees. We require the following decomposable property on RRR: given an input bitvector 
of n bits packed into 0(n/w) words of size w > logrt, RRR can be built in 0(n'/ log n) time for any chunk of 
n' > log n consecutive bits of the input bitvector, using table lookups and the Four- Russians trick; moreover, this 
0(n' I log n)-time work can be spread over 0(n'/ log n) steps, each of O(l) time, that can be interleaved with other 
operations not involving the chunk at hand. This a quite mild requirement and, for this reason, it is a general 
technique that can be applied to other static compressed bitvectors other than RRR with the same guarantees. 
Hence we believe that the following approach is of independent interest. 

Theorem 4.5 The append-only bitvector supports Access, Rank, Select, and Append on a bitvector f) in O(l) 
time. The total space is nHo((3) + o(n) bits, where n = \ 

Before describing the data structure and proving Theorem |4.5| we need to introduce some auxiliary lemmas. 

Lemma 4.6 (Small Bitvectors) Let /3' be a bitvector of bounded size n' = 0(polylog(ra)). Then there is a 
data structure that supports Access, Rank, Select, and Append on j3' in O(l) time, while occupying 0(polylog(n)) 
bits. 

Proof It is sufficient to store explicitly all the answers to the queries Rank and Select in arrays of n' elements, 
thus taking O(n'logn') = 0(polylog(n)). Append can be supported in constant time by keeping a running count 
of the Is in the bitvector and the position of the last and 1, which are sufficient to compute the answers to the 
Rank and Select queries for the appended bit. ■ 
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Lemma 4.7 (Amortized constant-time) There is a data structure that supports Access, Rank, and Select 
in O(l) time and Append in amortized 0(1) time on a bitvector /3 of n bits. The total space occupancy is 
nH (/3) + o(n) bits. 

Proof We split the input bitvector j3 into t smaller bitvectors Vt, Vt-i, ■ ■ ■ , V%, such that {3 is equal to the 
concatenation V t ■ V t -\ ■ ■ ■ V\ at any time. Let rii = |V<| > be the length of Vi, and nij be the number of Is in it, 
so that rrii = m and Y^l=i n i ~ n - Following Overmars's logarithmic method [2T], we maintain a collection 

of static data structures on Vt, Vt-i, . ■ ■ , Vi that are periodically rebuilt. 



(a) A data structure F\ as described in Lemma 4.6 to store /3' = V\. Space is 0(polylog(n)) bits. 

(b) A collection of static data structures F t ,F t -i, . . . ,F%, where each Fi stores Vi using RRR. Space occupancy is 
nH ((3) + o(n) bits. 

(c) Fusion Trees |10| of constant height storing the partial sums on the number of Is, s\ = X)}=t TO j> where sj = 0, 
and symmetrically the partial sums on the number of 0s, s° = X^=t( n j — m i)j setting = 0. Predecessor 
takes O(l) time and construction is 0(t) time. Space occupancy is 0(t\ogn) = o(n) bits. 

We fix r — clog no for a suitable constant c > 1, where no is the length n > 2 of the initial input bitvector 
p. We keep this choice of r until F t is reconstructed: at that point, we set no to the current length of /3 and we 
update r consistently. Based on this choice of r, we guarantee that r = 0(logn) at any time and introduce the 
following constraints: n\ < r and, for every i > 1, rii is either or 2 l_2 r. It follows immediately that t — 0(logn), 
and hence the Fusion Trees in (jcj) contain O(logn) entries, thus guaranteeing constant height. 

We now discuss the query operations. Rank(6, pos) and Select (6, idx) are performed as follows for a bit b g {0, 1}. 
Using the data structure in Q, we identify the corresponding bitvector Vt along with the number of occurrences 
of bit b in the preceding ones, Vt, ■ ■ ■ , V^+i- The returned value corresponds to the index i of Fi, which we query 
and combine the result with s\: we output the sum of s\ with the result of Rank(6, pos — X^=t n i) query on Fi in 
the former case; we output Select (6, idx — sf) query on Fi in the latter. Hence, the cost is O(l) time. 

It remains to show how to perform Append(6) operation. While n\ < r we just append the bit b to F\, which 



takes constant time by Lemma 4.6 When rt\ reaches r, let j be the smallest index such that rij = 0. Then 
5Zi=i n i = ^ ^t, so we concatenate Vj—i ■ ■ • V\ and rename this concatenation Vj (no collision since it was nj = 0). 
We then rebuild Fj on Vj and set Fi for i < j to empty (updating rij, . . . , n\). We also rebuild the Fusion Trees 
of Q, which takes an additional O(logn) time. When F t is rebuilt, we have that the new V t corresponds to the 
whole current bitvector f3, since Vt-i, ■ ■ ■ , Vi are empty. We thus set no := |/3| and update r consequently. By 
observing that each Fj is rebuilt every 0(rij) Append operations and that RRR construction time is 0(rij/ log n), 
it follows that each Append is charged 0(l/logn) time on each Fj, thus totaling 0(t/\ogn) = 0(1) time. ■ 



We now show how to de-amortize the data structure in Lemma 4.7 In the de-amortization we have to keep 
copies of some bitvectors, so the uHq term becomes 0(nHo). 

Lemma 4.8 (Redundancy) There is a data structure that supports Access, Rank, Select, and Append in 0(1) 
time on a bitvector /3 of n bits. The total space occupancy is 0(nHo(/3)) + o(n). 

Proof To de-amortize the structure we follow Overmars's classical method of partial rebuilding [21] . The idea is to 
spread the construction of the RRR's Fj over the next 0(rij) Append operations, charging extra O(l) time each. 



We already saw in Lemma 4.7 that this suffices to cover all the costs. Moreover, we need to increase the speed of 
construction of Fj by a suitable constant factor with respect to the speed of arrival of the Append operations, so 
we are guaranteed that the construction of Fj is completed before the next construction of Fj is required by the 



argument shown in the proof of Lemma 4.7 We refer the reader to |2T] for a thorough discussion of the technical 
details of this general technique. 

While V\ reaches its bound of r bits, we have a budget of O(r) = 0(logn) operations that we can use to 
prepare the next version of the data structure. We use this budget to perform the following operations. 

(1) Identify the smallest j such that rij = and start the construction of Fj by creating a proxy bitvector Fj 
which references the existing Fj—i, . . . , F\ and Fusion Trees in Q, so that it can answer queries in O(l) time 
as if it was the fully built Fj. When we switch to this version of the data structure, these Fj-\, . . . ,F\ become 
accessible only inside Fj. 
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(2) Build the Fusion Trees in Q for the next reconstruction of the data structure. Note that this would require to 
know the final values of the ms and mjS when V\ is full and the reconstruction starts. Instead, we use the 
current values of n% and m^: only the values for the last non-empty segment will be wrong. We can correct the 
Fusion Trees by adding an additional correction value to the last non-empty segment; applying the correction 
at query time has constant-time overhead. 

(3) Build a new version of the data structure which references the new Fusion Trees, the existing bitvectors 
Ft, ... , Fj+i, the proxy bitvector Fj and new empty bitvectors Fj-i, . . . , Fi (hence, rij = 2 J r and Tij—i = 
■■■ »: Oi. 

When rii reaches r, we can replace in constant time the data structure with the one that we just finished 
rebuilding. 

At each Append operation, we use an additional 0(1) budget to advance the construction of the FjS from the 
proxies FjS in a round-robin fashion. When the construction of one Fj is done, the proxy Fj is discarded and 



replaced by Fj. Since, by the amortization argument in the proof of Lemma 4.7 each Fj is completely rebuilt by 



the time it has to be set to empty (and thus used for the next reconstruction), at most one copy of each bitvector 
has to be kept, thus the total space occupancy grows from uHq{0) + o(n) to 0(nHo(f3)) + o(n). Moreover, when r 
has to increase (and thus the rij's should be updated), we proceed as in [21J. ■ 

We can now use the de-amortized bitvector to bootstrap a constant-time append-only bitvector with space 



occupancy nH a (/3) + o(n), thus proving Theorem 4.5 



Proof (of Theorem 4-5) Let j3 be the input bitvector, and L — 0(polylog(n)) be a power of two. We split j3 into 
til = [n/L\ smaller bitvectors B^s, each of length L and with rhi Is (0 < m, < L). plus a residual bitvector B' 
of length < \B'\ < L: at any time /3 = B\ ■ B^ ■ ■ ■ B nL ■ B' . Using this partition, we maintain the following data 
structures: 

(1) A collection F\, F2, . . . , F nL of static data structures, where each F stores Bi using RRR. 



(2) The data structure in Lemma 4.6 to store B'. 

(3) The data structure in Lemma |4.8| to store the partial sums §1 — Y^j=i ^j) setting s\ = 0. This is implemented 
by maintaining a bitvector that has a 1 for each position s*, and elsewhere. Predecessor queries can be 
implemented by composing Rank and Select. The bitvector has length til + m and contains til Is. The partial 
sums s° = ^2 % j—i{L — rh j ) are kept symmetrically in another bitvector. 

Rank(6, pos) and Select(&, idx) are implemented as follows for a bit b € {0, 1}. Using the data structure in 
we identify the corresponding bitvector Bi in ([lj or B' in ^ along with the number of occurrences of bit b in 
the preceding segments. In both cases, we query the corresponding dictionary and combine the result with s^. 
These operations take O(l) time. 

Now we focus on Append(fo). At every Append operation, we append a to the one of the bitvectors in ([3| 
depending on whether b is or 1, thus maintaining the partial sums invariant. This takes constant time. We 
guarantee that \B'\ < L bits: whenever \B'\ — L, we conceptually create B nL+ i := B', still keeping its data 
structure in reset B' to be empty, creating the corresponding data structure in append a 1 to the bitvectors 
in S. We start building a new static compressed data structure F nL+ i for B nL+ i using RRR in 0(L / \ogn) steps 
of 0(1) time each. During the construction of F nL+1 the old B' is still valid, so it can be used to answer the 
queries. As soon as the construction is completed, in 0(L/ \ogn) time, the old B' can be discarded and queries 
can be now handled by F nL +\. Meanwhile the new appended bits are handled in the new B', in O(l) time each, 
using its new instance of By suitably tuning the speed of the operations, we can guarantee that by the time 
the new reset B' has reached L/2 (appended) bits, the above 0{L) steps have been completed for F nL +\. Hence, 
the total cost of Append is just O(l) time in the worst case. 

To complete our proof, we have to discuss what happens when we have to double L := 2 x L. This is a standard 
task known as global rebuilding |5T|. We rebuild RRR for the concatenation of B\ and B2, and deallocate the 
latter two after the construction; we then continue with RRR on the concatenation of B3 and B4, and deallocate 
them after the construction, and so on. Meanwhile, we build a copy ([3]) of the data structure in ([3| for the new 
parameter 2 x L, following an incremental approach. At any time, we only have (JiJ) and Fn-\,Fn duplicated. 
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The implementation of Rank and Select needs a minor modification to deal with the already rebuilt segments. 
The global rebuilding is completed before we need again to double the value of L. 

We now perform the space analysis. As for (JlJ , we have to add up the space taken by Fi , . . . , F nL plus that taken 
by the one being rebuilt using F 2 i-.i,F 2 i. This sum can be upper bounded by 'Y^^ 1 {B(m il L) + o(L)) + O(L) = 
Hq(/3) + o(n). The space for ^ is (9(polylog(n)) = o(n). Finally, the occupancy of the sj partial sums in ^ is 
B(riL,ni + m) + o{nj, + m) — 0(ul log(l + m/n^)) = 0(n\ogn/ L) = o(n) bits, since the bitvector has length 
til + m and contains nt Is. The analysis is symmetric for the s° partial sums, and for the copies in (|3j). ■ 

4.2 Fully dynamic bit vectors 

We introduce a new dynamic bitvector construction which, although the entropy term has a constant greater than 
1, supports logarithmic-time Init and Insert/Delete. 

To support both insertion/deletion and initialization in logarithmic time we adapt the dynamic bitvector 
presented in Section 3.4 of 1 18] ; in the paper, the bitvector is compressed using Gap Encoding, i.e. the bitvector 
9o 10 Sl l ... is encoded as the sequence of gaps go,gi,..., and the gaps are encoded using Elias delta code |S]. 
The resulting bit stream is split in chunks of 0(log7i) (without breaking the codes) and a self-balancing binary 
search tree is built on the chunks, with partial counts in all the nodes. Chunks are split and merged upon insertions 
and deletions to maintain the chunk size invariant, and the tree rebalanced. 

Because of gap encoding, the space has a linear dependence on the number of Is, hence by Remark |4.2| it is not 
suitable for our purposes. We make a simple modification that enables an efficient Init: in place of gap encoding 
and delta codes we use RLE and Elias gamma codes |5j, as the authors of |9J do in their practical dictionaries. 

RLE encodes the bitvector o ro l ri O r2 l r3 . . . with the sequence of runs ro, r\, r2, ?% The runs are encoded with 

Elias gamma codes. Init(6, n) can be trivially supported by creating a tree with a single leaf node, and encoding a 
run of n bits b in the node, which can be done in time O(logn). In [5] it is proven the space of this encoding is 
bounded by 0(nHo), but even if the coefficient of the entropy term is not 1 as in RRR bitvectors, the experimental 
analysis performed in [9J shows that RLE bitvectors perform extremely well in practice. The rest of the data 
structure is left unchanged; we refer to (TH] for the details. 

Theorem 4.9 The dynamic RLE+7 bitvector supports Access, Rank, Select, Insert, Delete, and Init on a bitvector 
f3 in O(logn) time. The total space occupancy is 0(nHo((3) + logn) bits. 

5 Other Query Algorithms 

In this section we describe range query algorithms on the Wavelet Trie that can be useful in particular in database 
applications and analytics. We note that the algorithms for distinct values in range and range majority element 
are similar to the report and range quantile algorithms presented in |11J ; we restate them here for completeness, 
extending them to prefix operations. In the following we denote with C op the cost of Access/ Rank/ Select on the 
bitvectors; C op is 0(1) for static and append-only bitvectors, and O(logn) for fully dynamic bitvectors. 
Sequential access. Suppose we want to enumerate all the strings in the range [I, r), i.e. S'- l ' r > = Si, . . . , S r —i- We 
could do it with r — I calls to Access, but accessing each string Si would cost 0(|si| -I- h Si C op ). We show instead 
how to enumerate the values of a range by enumerating the bits of each bitvector: suppose we have an iterator on 
root bitvector for the range [I, r). Then if the current bit is 0, the next value is the next value given by the left 
subtree, while if it is 1 the next value is the next value of the right subtree. We proceed recursively by keeping an 
iterator on all the bitvectors of the internal nodes we traverse during the enumeration. 

When we traverse an internal node for the first time, we perform a Rank to find the initial point, and create 
an iterator. Next time we traverse it, we just advance the iterator. Note that both RRR bitvectors and RLE 
bitvectors can support iterators with O(l) advance to the next bit. 

By using iterators instead of performing a Rank each time we traverse a node, a single Rank is needed for each 
traversed node, hence to extract the i-th string it takes 0(|si| + ^7 ^2 seS i'- r ) ^sC op ) amortized time. 

Note that if (the set of distinct strings occurring in S^' r ^) is large, the actual time is smaller due to 

shared nodes in the paths. In fact, in the extreme case when the whole string set occurs in the range, we can give 
a better bound, 0(|sj| + -37 l^setlCop) amortized time. 

Distinct values in range. Another useful query is the enumeration of the distinct values in the range [l,r), 

\l r) 

which we called S^ t . Note that for each node the distinct values of the subsequence represented by the node 
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are just the distinct values of the left subtree plus the distinct values of the right subtree in the corresponding 
ranges. Hence, starting at the root, we compute the number of Os in the range [/, r) with two calls to Rank. If 
there are no Os we just enumerate the distinct elements of the right child in the range [Rank(l, I), Rank(l,r)). 
If there are no Is, we proceed symmetrically. If there are both Os and Is, the distinct values are the union of 
the distinct values of the left child in the range [Rank(0, 1), Rank(0, r)) and those of the right child in the range 
[Rank(l, I), Rank(l, r)). Since we only traverse nodes that lead to values that are in the range, the total running 
time is 0(Y] _«jp,r) \s\ + h s C op ), which is the same time as accessing the values, if we knew their positions. As a 
byproduct, we also get the number of occurrences of each value in the range. 

We can stop early in the traversal, hence enumerating the distinct prefixes that satisfy some property. For 
example in an URL access log we can find efficiently the distinct hostnames in a given time range. 

Range majority element. The previous algorithm can be modified to check if there is a majority element in the 
range (i.e. one element that occurs more than ! -=^ times in [l,r)), and, if there is such an element, find it. Start at 
the root, and count the number of Os and Is in the range. If a bit b occurs more than times (note that there 
can be at most one) proceed recursively on the 6-labeled subtree, otherwise there is no majority element in the 
range. 

The total running time is 0(hC op ), where h is the height of the Wavelet Trie. In case of success, if the string 
found is s, the running time is just 0(h s C op ). As for the distinct values, this can be applied to prefixes as well by 
stopping the traversal when the prefix we found until that point satisfies some property. 

A similar algorithm can be used as an heuristic to find all the values that occur in the range at least t times, 
by proceeding as in the enumeration of distinct elements but discarding the branches whose bit has less than 
t occurrences in the parent. While no theoretical guarantees can be given, this heuristic should perform very 
well with power-law distributions and high values of t, which are the cases of interest in most data analytics 
applications. 

6 Probabilistically-Balanced Dynamic Wavelet Trees 

In this section we show how to use the Wavelet Trie to maintain a dynamic Wavelet Tree on a sequence from a 
bounded alphabet with operations that with high probability do not depend on the universe size. 

A compelling example is given by numeric data: a sequence of integers, say in {0, . . . , 2 64 — 1}, cannot be 
represented with existing dynamic Wavelet Trees unless the tree is built on the whole universe, even if the sequence 
only contains integers from a much smaller subset. Similarly, a text sequence in Unicode typically contains few 
hundreds of distinct characters, far fewer than the s» 2 17 (and growing) defined in the standard. 

Formally, we wish to maintain a sequence of symbols S = (sq, . . . , s„_i) drawn from an alphabet E C U = 
{0, ...,u— 1}, where we call U the universe and E the working alphabet, with E typically much smaller than U 
and not known a priori. We want to support the standard Access, Rank, Select, Insert, and Delete but we are 
willing to give up RankPrefix and SelectPrefix, which would not make sense anyway for non-string data. 

We can immediately use the Wavelet Trie on S, by mapping injectively the values of U to strings of length 
[log u] . This supports all the required operations with a space bound that depends only logarithmically in u, but 
the height of the resulting trie could be as much as log it, while a balanced tree would require only log |E|. 

To obtain a balanced tree without having to deal with complex rotations we employ a simple randomized 
technique that will yield a balanced tree with high probability. The main idea is to randomly permute the universe 
U with an easy to compute permutation, such that the probability that the alphabet E will produce an unbalanced 
trie is negligibly small. 

To do so we use the hashing technique described in pQ. We map the universe U onto itself by the function 
h a (x) = ax (mod 2 r io e "1 ) where a is chosen at random among the odd integers in [1, 2^ logu ^ — 1] when the data 
structure is initialized. Note that h a is a bijection, with the inverse given by h~ 1 (x) = a _1 x (mod 2T lo s^1 ). The 
result of the hash function is considered as a binary string of [log it] bits written LSB-to-MSB, and operations on 
the Wavelet Tree are defined by composition of the hash function with operations on the Wavelet Trie; in other 
words, the values are hashed and inserted in a Wavelet Trie, and when retrieved the hash inverse is applied. 

To prove that the resulting trie is balanced we use the following lemma from (4] . 

Lemma 6.1 (|4]) Let E C U be any subset of the universe, and t = \(a + 2) log |E|] so that I < [log it] . Then 
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the following holds 

Prob (Vx, y e E /i„(a;) # (mod 2 £ )) > 1 - |E|- Q 

where the probability is on the choice of a. 

In our case, the lemma implies that with very high probability the hashes of the values in E are distinguished 
by the first t bits, where t is logarithmic in the |E|. The trie on the hashes cannot be higher than I, hence it is 
balanced. The space occupancy is that of the Wavelet Trie built on the hashes. We can bound L, the sum of trie 
labels in Theorem |3.6| by the total sum of the hashes length, hence proving the following theorem. 

Theorem 6.2 The randomized Wavelet Tree on a dynamic sequence S — (sq, . . . , s„_i) where Sj 6 E C U = 
{0, .. . ,u— 1} supports the operations Access, Rank, Select, Insert, and Delete in time 0(logu + hlogn), where 
h < (a + 2) log |E| with probability 1 — |E|~™ (and h < [logu] in the worst case). 
The total space occupancy is 0(nHo(S) + |E|u>) + |E| log it bits. 

7 Conclusion and Future Work 

We have presented the Wavelet Trie, a new data structure for maintaining compressed sequences of strings with 
provable time and compression bounds. We believe that the Wavelet Trie will find application in real-world storage 
problems where space-efficiency is crucial. To this end, we plan to evaluate the practicality of the data structure 
with an experimental analysis on real- world data, evaluating several performance/space/functionality trade-offs. 
We are confident that a properly engineered implementation can perform well, as other algorithm engineered 
succinct data structures have proven very practical (|20[ fT| fT4"]). 

It would be also interesting to balance the Wavelet Trie, even for pathological sets of strings. In [14] it was 
shown that in practice the cost of unbalancedness can be high. Lastly, it is an open question how the Wavelet Trie 
would perform in external or cache-oblivious models. A starting point would be a fanout larger than 2 in the trie, 
but internal nodes would require vectors with non-binary alphabet. The existing non-binary dynamic sequences do 
not directly support Init, hence they are not suitable for the Wavelet Trie. 
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APPENDIX 

A Multiple static bitvectors 

Lemma A.l Let S be a sequence of length n on an alphabet of cardinality a, with each symbol of the alphabet 
occurring at least once. Then the following holds: 

nHo(S) > (a- l)logn. 

Proof The inequality is trivial when a = 1. When there are at least two symbols, the minimum entropy is attained 
when (T — l symbols occur once and one symbol occurs the remaining n — a + 1 times. To show this, suppose 
by contradiction that the minimum entropy is attained by a string where two symbols occur more than once, 
occurring respectively a and b times. Their contribution to the entropy term is a log ^ + b log ?. This contribution 
can be written as f(a) where 

TL TL 

f{t) = tlog 7 + (b + a-t) log — , 

t b + a — t 

but f(t) has two strict minima in 1 and b + a — 1 among the positive integers, so the entropy term can be lowered 
by making one of the symbol absorb all but one the occurrences of the other, yielding a contradiction. 

To prove the lemma, it is sufficient to see that the contribution to the entropy term of the a — 1 singleton 
symbols is (a — 1) log n. m 

Lemma A. 2 0(|5 se t|) is bounded by o(hn). 

Proof It suffices to prove that 

Saat 



hn 



is asymptotic to as n grows. By Lemma 3.5 and Lemma A.l and assuming |S se t| > 2 



l^sctl ^ I'S'setl ^ I'S'setl ^ 2 



hn nH (S) i\S set \ - 1) log n logn' 
which completes the proof. ■ 

Lemma A. 3 The sum of the redundancy of a RRR bitvectors of mi, . . . , m a bits respectively, where m, = m, 
can be bounded by 

log log £ , 



O 



log- 



Proof The redundancy of a single bitvector can be bounded by c\ mi mi +c 2 . Since fix) — x x is concave, 
we can apply the Jensen inequality: 

1^/ miloglogm, \ floglog^ 
a \ log m, / log f 

The result follows by multiplying both sides by a. m 
Lemma A. 4 The redundancy of the RRR bitvectors in WTiS) can be bounded by o(hn). 



Proof Since the bitvector lengths add up to /in, we can apply Lemma A. 3 and obtain that the redundancy are 
bounded by 

_ log log r^-r 

O I hn 5 -' + |5 8et | 
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The term in \S se t \ is already taken care of by Lemma A.2 It suffices then to prove that 



l hn 

i0g |Ss.t| 



is negligible as n grows, and because f(x) — lo f^°^. x is asymptotic to 0, we just need to prove that ,g n grows to 
infinity as n does. Using again Lemma |3.5| and Lemma |A.1| we obtain that 



hn nH (S) (\S sct \ - l)logn logn 



S s , 



I.S's, 



thus proving the lemma. ■ 
Lemma A. 5 The partial sum data structure used to delimit the RRR bitvectors in WT(S) occupies o(hn) bits. 



Proof By Lemma A. 4 the sum of the RRR encodings is nHo(S) + o{hn). To encode the \S sc t\ delimiters, the 
partial sum structure of [22 takes 



|Sset|log 



nH (S) + o(hn) 



\S S , 



< \S set \log 



\S se t 

nH (S) 



| log 



0(13*1) 

o(hn) \ 
\S sct \ 



o(\s s , 



The third term is negligible by Lemma A.2 The second just by dividing by hn and noting that f{x) 
asymptotic to 0. It remains to show that the first term is o{hn). Dividing by hn and using again Lemma 
obtain 



log X 



3.5 



we 



\S SR t I loe 



(^)j g|rt |lo g (^) logoff) 



By using again that f(x) - 
n does, the result follows. 



hn nH (S) 
is asymptotic to and proving as in Lemma 



nH (S) 

|S 3 et| 



A.4 



that "^g ^ grows to infinity as 



B Dynamic Patricia Trie 

For the dynamic Wavelet Trie we use a straightforward Patricia Trie data structure. Each node contains two 
pointers to the children, one pointer to the label and one integer for its length. For k strings, this amounts to 
O(kw) space. Given this representation, all navigational operations are trivial. The total space is 0{kw) + \L\, 



where L is the concatenation of the labels in the compacted trie as defined in Theorem 3.6 

Insertion of a new string s splits an existing node, where the mismatch occurs, and adds a leaf. The label of 
the new internal node is set to point to the label of the split node, with the new label length (corresponding to the 
mismatch of s in the split node) . The split node is modified accordingly. A new label is allocated with the suffix of 
s starting from the mismatch, and assigned to the new leaf node. This operation takes 0(|s|) time and the space 
grows by 0(w) plus the length of the new suffix, hence maintaining the space invariant. 

When a new string is deleted, its leaf is deleted and the parent node and the other child of the parent need to 
be merged. The highest node that shares the label with the deleted leaf is found, and the label is deleted and 
replaced with a new string that is the concatenation of the labels from that node up to the merged node. The 
pointers in the path from the found node and the merged node are replaced accordingly. This operation takes 0(£) 
where £ is the length of the longest string in the trie, and the space invariant is maintained. 
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